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ABSTRACT 

The problem of recovering a signal from the magnitude of its short-time Fourier transform (STFT) is a 
longstanding one in audio signal processing. Existing approaches rely on heuristics which often perform 
poorly because of the nonconvcxity of the problem. We introduce a formulation of the problem that lends 
itself to a tractable convex program. We observe that our method yields better reconstructions than the 
standard GrifHn-Lim algorithm. We provide an algorithm and practical implementation details, including a 
discussion of how the method can be scaled up to larger examples. 



1. INTRODUCTION 

The problem of estimating a signal from the magni- 
tude of its short-time Fourier transform (STFT) has 
vexed audio signal processing researchers for decades 
[1, 2, 3]. The fundamental challenge can be summa- 
rized as: 



Fourier coefficients are complex numbers, 
hut in many applications we only know 
their magnitudes and not their phases. 

For example, in many frequency-domain audio pro- 
cessing techniques such as time-scale modification 



and speech enhancement, it is typically straightfor- 
ward to modify the magnitudes of the Fourier co- 
efficients but not their phases. In the case of time 
stretching, it is clear that the energy (i.e., magni- 
tudes) shoiild be spread over more time bins, but 
unclear how to specify the phases accordingly. 

The absence of phase information is problematic 
when we want to reconstruct the modified signal, 
for which it is necessary to invert the STFT — an 
operation which requires the full complex coeffi- 
cients (i.e., both magnitudes and phases). Hence, 
the magnitude-only reconstruction problem is also 
termed phase retrieval in the literature. Using in- 
compatible phase information in the reconstruction 
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can result in amplitude-modulation distortion [4]. 

Recently, interest in the magnitude-only reconstruc- 
tion problem has resurged in the source separation 
community, where matrix factorization has become 
a standard workhorse [5]. Since these approaches 
operate entirely on the magnitude spectrogram, the 
same difficulties arise in reconstructing the time- 
domain signals. A heuristic that is commonly used 
in this literature is to set the phases of each of the 
recovered signals equal to the phases of the origi- 
nal mixture signal [6]. Some justification has been 
offered for this approach; binder certain modeling as- 
sumptions, it is a consequence of the Wiener filter 
reconstruction [7]. 

Griffin and Lim [8] proposed a more general solution 
to the magnitude-only reconstruction problem. As 
their approach is now the standard, we briefly review 
their algorithm in the next section so that we may 
later contrast it with our approach. 

2. EXISTING APPROACHES 

Let |yj„(?Tii?, a;)| denote the desired magnitude 
STFT, where R is the hop size, w denotes the win- 
dow, and m and ui index the time frame and fre- 
quency, respectively. Our goal is to find a signal ,t, 
whose magnitude STFT \Xw{mR,uj)\ is as close to 
the desired as possible, in a least squares sense: 



oo 



/' 

J u}=—7r 



\Xyj{mR^uj)\ — \Yw{mR^uj)\) dco 



(1) 

The Griffin-Lim algorithm attempts to minimize this 
objective by starting with an initial guess for the sig- 
nal x^^^ and iterating the following steps until con- 
vergence: 

1. Compute the STFT of the current signal esti- 
mate x*^'): 

oo 



(2) 

Retain the phases of Xw , but replace its mag- 
nitudes by the desired magnitr 

Xi^+^\mR,uj) := \Y^{mR,uj) 



nitudes by the desired magnitudes: 

X^\mR,u;) 



X^\mR,co)\ 



2. Compute the IDTFT of X^^'^^ {mR, •), denoted 
^wXnW each m,} Then, find a signal x such 
that the windowed version of x, Xw,mRi is close 
to Xw,mR for each m in a least-squares sense. 
The solution turns out to be: 



Eoo 
m— - 



,w(mi?-n)x^+!],(n) 



(5) 



Griffin and Lim showed that this algorithm decreases 
the objective (1) on each iteration and converges to 
a stationary point. However, the objective is non- 
convex, so different initializations .x^"' can lead to 
different solutions, and there are no guarantees on 
convergence to a global optimum. In fact, as we 
will show, the algorithm often fails to recover a sig- 
nal from its true magnitude spectrogram, even when 
several initializations are used. 

It is worth noting that the magnitude-only recon- 
struction problem has been formulated in other 
ways: as a probabilistic model [9] and as a root- 
finding problem [10]. However, these methods suf- 
fer from the same drawbacks as the Griffin-Lim 
algorithm — namely, that they try to solve a noncon- 
vex problem and hence are prone to local optima. 
To our knowledge, the algorithm that we describe in 
the next section is the first convex formulation for 
this problem. 

3. CONVEX FORMULATION 

The Griffin-Lim algorithm resembles a number of 
earlier alternating projections algorithms [11] for 
solving the basic phase retrieval problem (i.e., where 
the magnitudes are of only a single Fourier trans- 
form). Recently it was shown in [12] that this basic 
problem could be recast in terms of estimating a 
rank-1 matrix X which is the outer product of the 
signal with itself, i.e., X = xx* . Once X is obtained, 
the putative signal can be recovered by factorizing 
X, provided that it is rank one. 



^Ideally we would like for 
sion of some signal x, i.e. 

-(i+i) 



to be the windowed ver- 



m«(") = - n)x{n). 



(4) 



To see that this is not always possible, consider the case where 
x^^rnRin) ^ but w{mR — n) = 0. Furthermore, we would 



(3) need a single signal x such that (4) holds for all m. 
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This technique of lifting the problem of estimating 
a vector x to estimating a matrix X = xx* is stan- 
dard in the optimization hterature [13]. Although it 
drastically increases the dimensions of the problem, 
it leads naturally to a tractable convex program, as 
we shall see. In this section, we formulate this prob- 
lem in the context of the STFT. In the spirit of [12], 
who term their algorithm PhaseLift, we call ours 
STliFT. 

First, wc consider the problem of recovering a signal 
from its true magnitude spectrogram (i.e., when a 
signal with that exact magnitude spectrogram ex- 
ists) . Suppose we use an FFT of length TV for each 
frame. Letting Sfc(n) = e^^^fe/w^ k = 0,...,N - 1 
denote the sampled complex sinusoids corresponding 
to this FFT length, the fc"^ Fourier coefficient for the 
rn'"''' time frame is {x„,._rnR, Sk), where Xw,mB. denotes 
the windowed signal a;^^„ij(n) = w{mR — n)x{n), 
zero-padded to the FFT length A''. Since we want 
the magnitude of each of these coefficients to be 
\Yw{mR,uJk)\^ the problem can be recast as: 

find X 

subject to \{xw,mR,Sk)\ = \Yy,{mR,LOk)\ (6) 
0<A;<A^, 0<m<T 

Next, we note that we can write the windowed signal 
Xw.mR as a matrix product W^rx, where W^r is of 
the form: 



mR 



( 



\ 



w{- 



N 
' 2 



1) 



-(f) 



Now we square both sides of the equality constraint. 
Using the fact that Xw,mR = WmRX, we can write 
the left-hand side as 



%R, Sk) 



iT^iiWmRx)'^ SksKWmRx)) 

t^m^RSk){w^RSkr xx^) 



Sk,; 



X 



where in the second step we have used the trace iden- 
tity tr(^i3) = tT{BA) to group x and x'^ into a single 
rank-1 matrix X. Hence the problem becomes: 



find X 

subject to rank(X) ~ 1, X y 

tr{Sk,mX) = \Y^^{mR,u!k)f 
0<k<N, 0<m<T 



(7) 



This is equivalent to the original problem and thus 
also nonconvex, except now the nonconvexity plainly 
arises from a rank constraint. We can relax (7) to 
a convex problem by replacing the rank constraint 
with its convex surrogate, a trace minimization: 



minimize ti{X) 
subject to X yO 

tr(5fc,™X) = \Y^{mR,ujk)\^ 
0<k<N, 0<m<T 



(8) 



This problem is not only convex; it is in fact the dual 
semidefinite program (SDP) described in [14]. 

Now we consider the case where there is not neces- 
sarily any signal with the specified magnitude spec- 
trogram, and we wish to find a signal whose mag- 
nitude spectrogram comes as close to the specifi- 
cation as possible. Because the above problem is 
formulated for the power spectrogram rather than 
the magnitude spectrogram, it cannot readily be 
adapted to minimize the distance between the mag- 
nitude spectrograms as in (1). However, it is 
straightforward to extend the above framework to 
minimize the distance between the power spectro- 
grams, by solving 

minimize ^(tr(S'fc,„X) - |F^(mJ?, 
+ A • tr(X) 

subject to X 

(9) 

where A controls the tradeoff between minimizing 
the distance between the spectrograms and the low- 
rank constraint on X. In practice, we initialize A 
large enough so that the solution is X = and solve 
(9) repeatedly for decreasing values of A until we 
obtain an (approximately) rank-1 solution. 

Before concluding this discussion of the model, we 
comment that Euclidean distance may be inappro- 
priate as a measure of divergence between the esti- 
mated and desired spectrogram. Euclidean distance 
is symmetric on a linear scale, so if the desired power 
in a time- frequency bin is 0.1, then Euclidean dis- 
tance would favor a reconstruction which assigns a 
power of 0.01 to that bin, over one which assigns 
0.2. However, the human ear perceives energy on 
a logarithmic scale; 0.01 represents a discrepancy 
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of 20 dB, whereas 0.2 represents one of only 6 dB. 

For these reasons, logarithmic divergences such as 
Kullback-Leibler or Itakura-Saito may be preferable 
to Euclidean distance for audio [7]. 

For purposes of comparison with Griffin-Lim, we 

only deal with the Euclidean case in this work. 
However, our formulation above can be readily ex- 
tended to these other divergences by simply replac- 
ing {ti{Sk,mX) - \Y^{mR,uJk)\'^)'^ by the appropri- 
ate divergence D{tT:{Sk,mX),\yw{mR,<^k)\^)- Such 
problems can be solved using the algorithm de- 
scribed in the next section, by modifying the gra- 
dient Vi appropriately. 

4. PRACTICAL IMPLEMENTATION 

In this section, we describe a practical algorithm for 
solving (9). Although the problem can in principle 
be solved using standard solvers, the scale of the 
problem (X is an n x n matrix, where n is the length 
of the signal) makes this all but impossible for any 
real- world problem. 

A projected gradient descent algorithm was pro- 
posed to solve the PhaseLift problem [12]. Letting 
L{X) denote the objective function in (9), the algo- 
rithm starts with an initial guess Xq and iterates: 

Xfc+i = r{Xk - tkVL{Xk)) 

where V denotes projection onto the feasible set X y 
0, VL the gradient, and tk a "suitable" step size. V 
performs an eigendecomposition on the matrix X 
and thresholds all negative eigenvalues to 0. 

For Euclidean distance as in (9), the gradient is: 

VL(X) = 2Y,itv{Sk,„^X)-\YUmR,L0k)\^)Sk,m+XI 

We emphasize that this is a conceptual, rather than 
a practical, formula. In practice, the structure of 
each Sk,m (its nonzero entries form a block within 
the nxn matrix) should be exploited for fast compu- 
tation. Such optimizations are implemented in the 
Matlab code for this algorithm, which will be made 
available at the first author's webpage.^ 

Although gradient tk^Hcent scales well to large prob- 
lems, it can converge slowly for ill-conditioned prob- 
lems, where the direction of steepest descent can be 

^http : //www-stat . Stanford. edu/~dlsvm 



nearly orthogonal to the direction of the minimum. 

This results in the characteristic "zig-zag" trajectory 
of gradient descent [15]. One way to accelerate the 
convergence is to evaluate the gradient not at Xf. 
but at an auxiliary point Yk which is a function of 
the past values of X, in order to capture "momen- 
tum" of the trajectory. [12] suggests a variant of the 
FISTA acceleration scheme [16]: 

Xk+i = V{Yk - tkVL{Yk)) 

Ok+i = 2 (^1 + /l + 4/^) 

/3fc+l = (^k+l{0^. — 1) 

Ffe+i = Xk+i + MXk+i - Xk) (10) 

We have found this algorithm to scale to reasonably 
large problems and have confirmed that the acceler- 
ations defined by (10) drastically improve the perfor- 
mance of projected gradient descent in our setting. 

The final piece of the algorithm is the step size tk. 
Although tk can be chosen using a line search, eval- 
uating the objective is prohibitively expensive for 
large problems since it involves an eigendecomposi- 
tion. We have found choosing a fixed step size by 
trial and error to work well in practice. For step 
sizes that are too large, the objective values diverge 
within a few iterations, so we use the largest step 
size for which the objective values still converge. 

5. RESULTS 

First, we consider the "noiseless" problem of recov- 
ering a signal from its true magnitude STFT. We 
first generated 20 random signals each of length 
n = 16 and n = 32. We then determined the 
magnitude STFT of each signal using Hann win- 
dows of length M = 5, 7, 9, 11 for both the n = 16 
and n = 32 signals, and also windows of length 
M = 13, 15, 17, 19, 21 for the n = 32 signals. We 
considered all hop sizes that resulted in constant 
overlap-add for each window length. The FFT size 
was taken to be the same as the signal length: 
N ^n. 

We then applied the Griffin-Lim algorithm and the 
program (8) to each magnitude spectrogram to ob- 
tain a signal estimate. For GrifFin-Lim, we used 10 
random initializations and retained only the "best" 
(see below) of the 10 initializations. To evaluate 
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performance, we tabulated the percentage of the 20 

signals whose estimates achieved a value of (1) less 
than lO^''. Results arc presented in Table 1. 

We found that our algorithm recovers the signal 
more consistently than the GrifRn-Lim algorithm. 
Furthermore, the objective values attained by the 
latter when it did not converge to the optimum were 
much larger than 1, which empirically supports the 
claim that alternating projections algorithms have a 
tendency to get stuck in local optima for nonconvex 
problems. Recovery is a problem for both algorithms 
when the hop size is large, but especially for Griffin- 
Lim, most likely because there are fewer constraints 
with a large hop size, so the landscape is dotted with 
more local optima. 

Next, we consider a more realistic "noisy" setup in 
which there may not be any signal with the given 
magnitude STFT and we wish only to find a sig- 
nal minimizing (1). We take the same 20 signals as 
before, calculate their power spectrogram, and add 
i.i.d. Gaussian noise with mean and standard de- 
viation 0.2. (We set to zero any entries which end 
up negative as a result.) We then apply the Grifhn- 
Lim algorithm and the program (9) to this modified 
magnitude STFT, and compare the values of (1) for 
the two estimates. Results are presented in Table 2. 

Here the advantages of our algorithm are not so 
clear cut. GrifRn-Lim tends to perform better for 
large hop sizes, whereas our algorithm does better 
for smaller hop sizes. However, the two algorithms 
appear to become increasingly similar as window 
length increases. 

6. DISCUSSION 

Although the algorithm is promising on small exam- 
ples, it remains to scale it to real audio data. To 
appreciate the difficulties of scaling the algorithm, 
note that the algorithm estimates an n x n matrix, 
where n is the length of the audio clip. For a 3- 
second clip sampled at 44.1 kHz, this amounts to 
estimating a 10^ x 10^ matrix. 

We make several observations which may be useful 
for scaling the algorithm. First, we note that the al- 
gorithm is entirely parallelizable in the sense that we 
can divide the signal into smaller segments and run 
the algorithm separately on each chunk. The only 
thing that we lose is the information that would have 



n= 16 



Window 
Length (M) 


Hop 
Size (R) 


Accuracy 
(%) of G-L 


Accuracy 
(%) of (8) 


5 


2 





65 


5 


1 


30 


100 


7 


3 





50 


7 


2 


10 


100 


7 


1 


60 


100 


9 


4 


5 


60 


9 


2 


70 


100 


9 


1 


80 


100 


11 


5 


5 


60 


11 


2 


95 


100 


11 


1 


95 


100 




Tl 


= 32 




Window 

Length (M) 


Hop 

Size (R) 


Accuracy 
(%) of G-L 


Accuracy 
(%) of (8) 


5 


2 





20 


5 


1 





100 


7 


3 





15 


7 


2 





100 


7 


1 


15 


100 


9 


4 





15 


9 


2 


10 


100 


9 


1 


20 


100 


11 


5 





20 


11 


2 


55 


100 


11 


1 


55 


100 


13 


6 





55 


13 


4 





100 


13 


3 


45 


100 


13 


2 


80 


100 


13 


1 


65 


100 


15 


7 





55 


15 


2 


85 


100 


15 


1 


80 


100 


17 


8 





60 


17 


4 


60 


100 


17 


2 


80 


100 


17 


1 


75 


100 


19 


9 





60 


19 


6 





100 


19 


3 


85 


100 


19 


2 


85 


100 


19 


1 


90 


100 


21 


10 





20 


21 


5 


65 


100 


21 


4 


90 


100 


21 


2 


85 


100 


21 


1 


100 


100 



Table 1: Comparison of Griffin-Lim and program (8) 
for different configurations in the "noiseless" setup. Ac- 
curacy is the percentage of the 20 test signals for which 
the objective (1) of the solution was < 10~^. 
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Griffin-Lim 




Time (sample) 



Fig. 1: We applied thie Griffin-Lim algorithm and 
program (8) to a random signal of length n — 32 
with a window of length M = 15 and a hop size of 
R ~ 7. Shown at top is an instance where Griffin- 
Lim fails to converge to the global optimum of (1) 
and returns an unreasonable solution. Shown at 
bottom: our algorithm basically recovers the signal 
from the magnitude spectrogram, up to the numer- 
ical precision of the algorithm. 







= 16 




Window 


Hop 


Median Rel. 


Median Kel. 


Length (M) 


Size (R) 


% Err. of G-L 


% Err. of (9) 


5 


2 


1.18 


5.66 


5 


1 


0.82 


0.86 


7 


3 


1.25 


4.27 


7 


2 


1.85 


1.00 


7 


1 


0.44 


0.35 


9 


4 


1.03 


4.32 


9 


2 


0.66 


0.41 


9 


1 


0.31 


0.27 


11 


5 


0.85 


5.60 


11 


2 


0.35 


0.31 


11 


1 


0.20 


0.21 




Ti 


= 32 




Window 


Hop 


Median Rel. 


Median Kel. 


Length (M) 


Size (R) 


% Err. of G-L 


% Err. of (9) 


5 


2 


0.89 


16.7 


5 


1 


1.72 


0.22 


7 


3 


1.00 


3.91 


7 


2 


2.10 


0.24 


7 


1 


0.77 


0.12 


9 


4 


0.46 


7.80 


9 


2 


1.43 


0.11 


9 


1 


1.07 


0.06 


11 


5 


0.52 


8.65 


11 


2 


0.74 


0.09 


11 


1 


0.67 


0.05 


13 


6 


0.40 


9.82 


13 


4 


1.06 


0.15 


13 


3 


1.15 


0.12 


13 


2 


0.39 


0.08 


13 


1 


0.05 


0.04 


15 


7 


0.40 


4.62 


15 


2 


0.07 


0.07 


15 


1 


0.04 


0.04 


17 


8 


0.39 


5.77 


17 


4 


0.12 


0.11 


17 


2 


0.06 


0.06 


17 


1 


0.04 


0.04 


19 


9 


0.31 


3.08 


19 


6 


0.93 


0.14 


19 


3 


0.09 


0.08 


19 


2 


0.05 


0.05 


19 


1 


0.04 


0.03 


21 


10 


0.51 


4.83 


21 


5 


0.15 


0.14 


21 


4 


0.09 


0.09 


21 


2 


0.05 


0.05 


21 


1 


0.03 


0.03 


Table 2: Comparison of Griffin-Lim and 


program (9) 


for different configurations 


in the "noisy" 


setup. The 



median relative error (defined as (1) divided by the total 
power f. \ Yu,{mR, Wfc) over 10 test signals is shown. 
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been provided by the time frames that stretch across 
two segments. Furthermore, it may not be neces- 
sary to compute the phase over the entire signal. 
Since phase is not audible over stationary stretches, 
one could apply the algorithm only on the transient 
time frames and allow phase to "free run" elsewhere. 
Since transients are short in duration and often in- 
volve smaller windows, these problems will be much 
smaller in scale. 

7. CONCLUSION 

This paper has introduced a novel method for esti- 
mating a signal from its magnitude STFT. Although 
it remains to scale the algorithm to real-life audio ap- 
plications, our approach is highly competitive with 
existing algorithms in stylized simulations. 
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